IODS stands for Introduction to Open Data Science - a MOOC platform course aiming to teach how to use open software tools and to analyse (openly available) data sets.
We will learn to use the following tools: R, R Studio, R Markdown and GitHub (including GitHub Desktop). Other resources inlude DataCamp and Slack.
The topics discussed during the course are: Regression and model validation, Logistic regression, Clustering and classification, Dimensionality reduction techniques, and Multivariate statistical modelling.
At the core of the course are the exercises: weekly DataCamp and RStudio exercises (including peer reviews) as well as a final assignment for which one of the topics of the course will be investigated in more detail.
Here is the link to my GitHub repository.
The dataset used in this analysis was prepared beforehand. Originally, it consisted of 60 different variables and a total of 183 observations. Several subgroups of variables were interrelated and they were combined into single variables by calculating the mean value of each row for each subgroup. Then, observations where exam points were zero were excluded.
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.75 2.88 3.88 3.5 3.75 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data deals with different aspects of learning statistics and their exam points. Variables ‘deep’, ‘stra’ and ‘surf’ are combined variables which were mentioned earlier, and they consist of questions related to the students’ learning obejctives: deep learning, strategic learning and surface learning. Variable ‘attitude’ represents the student’s global attitude towards statistics. Variable ‘gender’ includes two values: male (‘1’) and female (‘2’), and variable ‘age’ is recorded as the students’ numerical age in years.
## [1] 166 7
Thus, the final dataset consists of seven variables and 166 observations. The following picture provides a visual overview of the data. It shows the variables in more detail, as well as the relationships between different variables, including their correlation coefficients.
Let us begin with ‘gender’. As can be seen below, there are nearly twice as many women than there are men.
## F M
## 110 56
Here are some summaries of the gender variable as a whole,
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 21.0 22.0 25.5 27.0 55.0
as well as for the subgroups separately (first male students, then female students).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 21.0 24.0 26.8 29.0 55.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 20.0 22.0 24.9 26.0 53.0
The overall ages range from late teens to mid-fifties. Both subgroups are rather similar. 50% of the students in both gender groups are in their twenties and female students are on average younger than male students (male: 21-29 / mean 26.8, female: 20-26 / mean 24.9). Within this 50%, the median values are located closer to the younger end (male: 24, female: 22). There are also many outliers in the older end of both subgroups.
It is perhaps not surprising, that male students’ attitude towards statistics is slightly more positive, but here too, the ranges are wide and there are exceptions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 31.0 34.0 34.4 39.0 48.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 25.0 29.5 29.9 35.8 50.0
The central 50% of observations for women falls between 25.0 and 35.8, where as for men it is 31.0-39.0. Although the male group appears more uniform, there are outliers in this subgroup.
Both male and female students are in favour of deep learning. Male student’s exam points are somewhat higher, but female students tend to be a bit more consistent in their results as the box are is more narrow - and keeping in mind that there are twice as many students in this subgroup.
The strongest positive correlation (0.437) can be found between attitude and points, and it is stronger for men (0.451) than women (0.422), which is also easy to see in the kernel density graphs. Attitude also has a positive average correlation with age (0.0222) - for women (0.000756), attitudes towards statistics become very slightly more positive as they grow older, but the opposite holds for men (-0.0414).
Both deep learning and strategic learning have a positive correlation with age and surface learning have a negative one. As one gets older, one takes studying more seriously while being more selective in the process. Unfortunaley, and somewhat surprisingly, age has a slightly negative correlation with points.
The next step is to try and fit a regression model, i.e. to try to explain the behaviour of the dependent variable (y) - in this case exam points. Explanatory variables were selected based on the findings in the previous section and the regression model can be stated as points ~ attitude + stra + age. Let us look at a summary of this model:
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.11 -3.20 0.33 3.41 10.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.8954 2.6483 4.11 6.2e-05 ***
## attitude 0.3481 0.0562 6.19 4.7e-09 ***
## stra 1.0037 0.5343 1.88 0.062 .
## age -0.0882 0.0530 -1.66 0.098 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.218, Adjusted R-squared: 0.204
## F-statistic: 15.1 on 3 and 162 DF, p-value: 1.07e-08
The summary reveals that attitude is in fact a valid predictor for students’ success in their exams. The hypothesis behind testing of the model is that beta equals zero. Here, the estimated value of beta for attitude is 0.3481 with a standard error of 0.0562. This means that for a one unit increase in attitude, the exam points are expected to increase by 0.3481. The p-value of t-test is very low (4.7e-09) which means that the test result (the rejection of null hypothesis: beta=0) is highly significant.
As for strategic learning and age, their p values are furher away from zero and thus significant only at a signifance level of 0.1. The beta value of stra is higher than that of age. As age has the highest p value, I will remove it from the model in order to see how it affects the model. Here is a summary of the resulting model lm(y ~ x1 + x2) where attitude and strategic learning are used as explanatory variables for exam points.
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.644 -3.311 0.558 3.793 10.930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.75 0.00025 ***
## attitude 0.3466 0.0565 6.13 6.3e-09 ***
## stra 0.9137 0.5345 1.71 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.29 on 163 degrees of freedom
## Multiple R-squared: 0.205, Adjusted R-squared: 0.195
## F-statistic: 21 on 2 and 163 DF, p-value: 7.73e-09
Now that age has been removed, the beta values of both attitude and strategic learning have somewhat decreased (attitude: 0.3481 > 0.3466, stra 1.0037 > 0.9137) while the standard errors remain next to unaltered (attitude: 0.0562 > 0.0565, stra 0.5343 > 0.5345). The p values have also changed a little but their interpretations are the same as in the first model.
Based on these test results for the individual variables, it looks as if the model now fits better, but there is still the model as a whole to consider. The multiple R-squared is a measure for the goodness of the fit of the model. If the value is 1 (maximum), the regression model fits the observations exactly and there are no residuals. Here, the multiple R-squared value is 0.205 which means that the model is able to predict exam points with an accuracy of 20.5 %. In the first model (with three explanatory variables) the multiple R-squared value was higher (0.218) which shows that the more explanatory variables the model has, the higher the R-squared value will be. The same applies for the adjusted R-squared which accounts for the loss of explanatory power in the model due to less significant variables, and therefore will be lower than the multiple R-squared value. Its value in the first model was 0.204 and, having removed one variable, its value now is 0.195.
Although the R-squared values for a model with two explanatory variables will be lower than for a model with three variables, they are very similar and in both cases the explanatory power of the models is roughly 20% which indicates that the variables in this dataset alone do not explain how well a student does or does not do in their exam. Should I venture a guess, I would say that the hours spent studying have a more decisive impact on the results.
Also, the p values of the F test (focuses on the entire model instead of individual variables) - in both cases the p value is extremely small. The critical levels of F with (3, 162) / (2, 163) degrees of freedom 3.78 / 4.61 at the 1 percent significance level, so F statistics of 15.1 / 21 indicate significant degrees of explanation in both models.
In the previous section, a regression model was fitted and investigated as to how well it fitted the observations. The purpose of this section is to look at the model from the point-of-view of residuals which reveal the downfalls of the model. They turn our attention to what was left over after the model was fitted and could reveal patterns which affect the fitted model. Here, I have chosen to investigate the latter model: points ~ attitude + stra.
Let us begin by looking at residuals versus fitted values which will show if residuals have non-linear patterns, i.e. if the variance of residuals is not constant.
The residuals are rather nicely spread around the horizontal line and there are no distinct patterns. There are some observations (35, 56, 145) which lie longer away from the line and would therefore be worth investigating closer, but other than those, there do not appear to be any major issues with the model.
Errors (residuals) in the regression model are assumed to be normally distributed, and the QQ-plot shows if this is the case for the fitted model.
Again, the residuals do not fall exactly on the straight line, especially in the beginning and the end, In general the plot does not reveal any major concerns, although observations 35, 56 and 145 do stand out again.
Finally, let us take a look at the residuals versus leverage. The following plot shows if there are influential observations, i.e. how much impact single observations have on the model.
In this plot, we are not looking for patterns, but rather for outlying values at the upper or lower right corners as these would the ones that have influence on the regression model. As we can see, there are not any in this plot. Observations 35, 71 and 145 are marked but as they lie within Cook’s distance lines (which actually lie outside the plot area), they are not influential and the model appears to have a reasonably good fit.
However, the same numbered outliers appeared in all of these plots. The next step would be to take a closer look at them. They might have significance, or be just errors in the data.
Go to: top of this section or top of diary.
The dataset used in this chapter contains combined information about the alcohol consumption of students of both maths and Portugese. The original datasets are available at the UCI Machine Learning Repository.
The combined dataset consists of 35 variables and 382 observations.
## [1] 382 35
In addition to alcohol consumption, the data provides a variety of background information for each student.
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The information provided by the variables can be summarised as follows:
The original variables were complemented with two new variables:
In order to explain what causes for the level of alcohol consumption (high/low), we need to consider the other variables and how they might be related to this issue.
I expect that what we are dealing with is a group of young adults beginning to experiment with alcohol. I believe that in this age, especially in the younger end, the key factors behind alcohol use are peer pressure and amount of parental supervision. But it must also be kept in mind that the situation of a 15-year old student can be very different from that of a 22-year-old student. For this reason, we must factor in other related variables which may perhaps be the causes behind the level of alcohol consumption but correlate with it, such as academic behaviour. Time spent studying and grades do not appear as worthy factors, since they depend heavily on the student’s intelligence and to a lesser extent on their alcohol consumption. On the other hand, time spent partying and drinking does result in fatigue (to say the least) and could therefore be reflected in attendance.
Based this reasoning, I have selected the following four variables:
| Factor | Variable | Description of variable |
|---|---|---|
| age | age | student’s age |
| peer pressure | goout | going out with friends |
| parental supervision | famrel | quality of family relations |
| academic success | absences | number of school absences |
Let us next explore the selected variables beginning with age.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.0 16.0 17.0 16.6 17.0 22.0
## [1] 0.113
The age range is from 15 to 22 and the Pearson correlation value between age and high_use is 0.113. Older students tend to use more alcohol (high_use = TRUE) than younger ones. It is interesting to note that for high consumption the median age is lower for girls than for boys. In fact, 50% girls consuming a lot of alcohol are 16-17 years old, whereas the boys’ range is slightly wider. Both sexes start experimenting around the same age, but girls “peak” earlier, so to speak, and contain that level for a longer time.
Next we have time spent with friends, “goout”, where value 1 equals very low and 5 equals very high.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 3.11 4.00 5.00
## [1] 0.354
The figure above shows a clear difference between the consumption levels. Those in the high consumption group go out more often than those in the low consumption group, suggesting that peer pressure does indeed have an effect on alcohol consumption. Comparison between the two consumption levels reveals an interesting gender difference as well. Girls in the low consumption group go out more than boys, and even as much as in the high consumption group. For boys it either or: those boys who do not go out a lot do not drink much, and vice versa. Thus, girls spend a lot of time with their friends, but alcohol consumption is not as important a factor in building their social relations as it is for adolescent boys. Also, the correlation is now 0.354 which is higher than for age.
Next, let us zoom out a bit and look at the quality of family relationships where the values of the variable “famrel” are like those of our previous variable (1 = very bad, …, 5 = excellent).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 4.00 3.94 5.00 5.00
## [1] -0.113
With this variable, there is a negative correlation of -0.113 meaning that the better the relationships within the family are, the less likely it is that the offspring will consume high levels of alcohol. But the difference is not very sharp, as the figure reveals. The majority of students in both consumption levels appear to have relative good relationships with their families and although low consumption corresponds a bit better with very good family relationships, the medians are surprisingly the same (4) for both levels.
Absences, on the other hand, are not quite as clearly differentiated as I had anticipated.
## [1] 0.223
The correlation value is 0.223, and the figure shows that absences are a little more frequent in the high consumption group. But the medians are almost identical, and girls have more outliers in both consumption groups. Any meaningful differences can again be seen looking at the boys: low consumption corresponds to very few absences, less than for girls, whereas high consumption corresponds to some absences more than for girls.
Now I have fitted a logistic regression model between the variables discussed in the previous section (explanatory variables) and the level of alcohol consumption which is, as we have seen, is a binary variable (high / low). Below is a summary of the fitted model.
##
## Call:
## glm(formula = high_use ~ age + goout + famrel + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.804 -0.756 -0.533 0.890 2.382
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.9196 1.8543 -2.11 0.03453 *
## age 0.0966 0.1096 0.88 0.37770
## goout 0.7582 0.1206 6.29 3.2e-10 ***
## famrel -0.3564 0.1356 -2.63 0.00857 **
## absences 0.0721 0.0218 3.30 0.00095 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 395.01 on 377 degrees of freedom
## AIC: 405
##
## Number of Fisher Scoring iterations: 4
Looking at the summary, we can see that apart from age, the selected explanatory variables are significant. Time spent with friends and absences from school are the most significant ones with test scores very close to zero. Although age as itself does not score very high in this model, the model as a whole is significant at a 0.01 significance level.
Let us next look at the odds ratio (OR) and confidence intervals (CI) of the fitted model.
## OR 2.5 % 97.5 %
## (Intercept) 0.0198 0.000496 0.726
## age 1.1015 0.888842 1.367
## goout 2.1345 1.695118 2.723
## famrel 0.7002 0.535323 0.913
## absences 1.0747 1.030941 1.124
Time spent with friends (“goout”) is twice as likely (OD = 2.1345) to result in a high level of alcohol consumption, as opposed to a low level of consumption. This complies well with the model’s test results above. The odds ratios for absences and family relationships are also similar to the test results. Interestingly, age was not as significant a variable as the other variables, but here the odds ratio for age (1.1015) is actually higher than for absences (1.0747). However, the 95% confidence interval reveals that age is not statistically significant at the 0.05 level (because the interval contains 1.0), and the other variables are significant.
In order to explore the predictive power of my model to its full extent, I have modified it by including only those variables which were found to be statistically significant, i.e. I have dropped the age variable. The logistic regression equation is now:
\(y = -2.3882 + 0.7707(goout) - 0.3499(famrel) + 0.0745(absences)\)
Based on this modified model, you will find below a 2x2 cross tabulation of the predictions versus actual values of the level of alcohol consumption (TRUE being ‘high’ and FALSE being ‘low’). The prediction threshold has been set to 0.5.
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.6387 0.0628 0.7016
## TRUE 0.1806 0.1178 0.2984
## Sum 0.8194 0.1806 1.0000
The odds for a high level of alcohol consumption are 82%, but only about 64% of those predictions will be correct. Or, to put it another way, the odds for a low level of alcohol consumption are 18%, but only about 12% of those predictions will be correct.
We have now considered the model from the point of view of correctly predicted values, but we still need to find a measure for the accuracy of the model as a whole. This can be done by calculating the mean of incorrect predictions (penalty).
## [1] 0.243
So, the lower the penalty the better (no model will ever be completely accurate). The value here (0.243) seems reasonably low and in any effect it is lower than the value attained from the model used in the DataCamp exercises (0.256).
Continuing with the modified model, I will next perform a 10-fold cross-validation in order to test its actual predictive power.
## [1] 0.249
Compared to the model used in DataCamp, it appears that my model performs better (< 0.262). Both models have three explanatory variables, but only they have only one in common: ‘absences’. The difference in performance can only be explained by differences in the models. The DataCamp model considers academic underachievement (‘failures’) and gender (‘sex’) whereas I decided not to include academic performance due its dependence on other factors (e.g. intelligence) and instead investigated the influence of peer pressure (‘goout’) and parental involvement (‘famrel’). It would appear that my line of reasoning has directed me in the right direction.
The question still remains: is my model, although modified, the best possible one. Since only four variables were originally chosen, there are still over 30 variables which have not been put to the test. In this section I will approach modeling from another perspective: by favouring statistical measures and iterations over personal reasoning.
N.B. I will use K-fold cross-validation and display their results, but summaries will not be displayed for the sake of legibility.
Let us begin with a wide variety of possibly releant variables.
## [1] 0.272
The error is now higher than in any of the earlier models but this was expected - some variables are better predictors than others.
The summary of this model revealed that there are more not significant variables than there are significant ones. I will now drop all those variables which are not significant for any values at 0.05.
## [1] 0.241
As we can see, there is vast improvement and this model has given us the best result yet. There are still nine variables left so surely not all of them are valuable for the model.
There are actually several which can be dropped: ‘reason’ (although one value is somewhat significant, the others really are not), ‘paid’, ‘freetime’ and ‘health’. This leaves us with a total of five remaining variables.
## [1] 0.23
Again, the model improved, but why stop here?
I will remove yet another variable (‘address’) which, although it is significant at 0.05, is the least significant. Now we are left with four variables: ‘studytime’, ‘goout’, famrel’ and ‘absences’. Let us see what happens.
## [1] 0.238
As it turns out, the result is worse. Thus, the previous model was the best one. I am happy to note that it included all of the variables from my modified model (‘goout’, ‘famrel’, ‘absences’). On the other hand, my hypothesis about time spent studying not being a good measure was wrong, and I had not considered the effect of urban versus rural (‘address’) living at all which, of course, does determine one’s mobility and thus social environment.
Go to: top of this section or top of diary.
This week I am using a built-in dataset, namely the Boston dataset which contains housing information in the Boston Mass area. The data has been collected by the U.S Census Service and it is also available online.
The Boston dataset consists of just 506 observations and there are 14 variables. All of the variables are continuous. Let us take a closer look at what they are.
The variable names are not exactly self-explanatory. Because our analysis will again depend upon them, I will briefly list them here with a short description.
| Variable | Description |
|---|---|
| crim | crime rate (per capita) |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres |
| chas | Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
| nox | nitric oxides concentration |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio |
| black | proportion of blacks |
| lstat | % lower status of the population |
| medv | median value of owner-occupied homes in $1000’s |
Thus, the Boston dataset consists of a variety of information: demographic, economic, and environmental factors as well as safety. I will not discuss the variables in more detail just now, but save it for later when we explore the standardized dataset - the comparison between the original and scaled variables will be more meaningful if the summaries are presented together.
So, let us next take a graphical tour of the data. In the figure below, each variable is plotted against the other variables.
Here we see some interesting distribution patterns, such as
* accumulation close to the edges and corners of the box - e.g. age/lstat
* more curved shapes close to the lower left corner - e.g. nox/dis
* compact round shapes close to one of the corners - e.g. rad/tax
* binary positionas in all pairs for variables chas and rad.
Another way of approaching the relationships between variables is to look at their correlations. Below is a correlogram where, instead of numerical values, the correlations are presented as coloured spots. Red colour indicates a negative correlation and blue colour indicates a posititve correlation. The stronger the correlation, the bigger the spot.
Three pairs stand out as having a strong negative correlation: indus/dis, nox/dis and lstat/medv. It is interesting to note the connection between this and the previous graph. If we go back to the previous graph, we can see that these pairs are those which showed curved patterns.
As for positive correlations, there are many to choose from but one stands out from the rest: rad/tax. This was also identified above and it is the one with a compact round shape.
The first step before attempting actual classification is to standardize the dataset. This is because we need the variables both to be comparable and to fit assumptions on which the classification method is based on (variables are normally distributed and their variance is the same).
Standardization is a simple operation in which the values of each variable are calculated using the following formula:
\(scaled(x) = [x-mean(x)]/sd(x)\)
i.e. the variable mean is substracted from the original value of the variable and the resulting difference is then divided by the standard deviation of the variable.
Now let us take a closer look at a selection of these scaled variables and compare them to the original, non-scaled, dataset (original values are shown first and then the scaled values). Here are summaries of age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.9 45.0 77.5 68.6 94.1 100.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.330 -0.837 0.317 0.000 0.906 1.120
tax (full value property tax rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187 279 330 408 666 711
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.310 -0.767 -0.464 0.000 1.530 1.800
and black (proportion of blacks in the area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 375 391 357 396 397
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.90 0.20 0.38 0.00 0.43 0.44
The standardization procedure has affected the variables considerably. To begin with, the range has been radically reduced: for age [2.9,100.0] -> [-2.333,1.116], for tax [187,711] -> [-1.313,1.796], and for black [0,397] -> [-3.90, 0.44]. What is most noteworthy, though, is the change in the mean values - in the scaled dataset, the mean value is 0.000 for all variables. The variables are not aligned and thus more easily comparable.
I will use this scaled dataset from this point onwards. There are still some minor adjustments to be made. In the analysis to come, I will use crime rate as the target variable, and in order to do so the continuous variable crim needs to be converted into a new, categorical variable crime (after which the original variable will be removed, because all the other variables will serve as the explanatory variables). The quantiles of the scaled variable crim
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.42 -0.41 -0.39 0.00 0.01 9.92
will serve as the breakpoints for the new categorical variables crime - the table below shows its distribution between the newly ceated categories.
## crime
## low med_low med_high high
## 127 126 126 127
Creating a model from your data is interesting by itself, but it falls short if there is nothing to test it on. Luckily, we can divide the dataset into training and test sets where the former is used for creating the model and the latter is used as new data for testing the model’s prediction power. I will use 80% of my data for training and the remaining 20% for testing.
Now that the data has been prepared and divided, we can fit the linear discriminant analysis (LDA) on the previously created training set. LDA is a classification method where the classes are known beforehand. In our case they are the previously created four categories denoting different crime levels.
LDA finds a linear combination of the explanatory variables which best characterizes or separates the different classes of the target variable. In this current case, the categorical variable crime is the target variable and all other variables are the explanatory, or predictor, variables. Here is a scatterplot of the LDA where different colours represent the different classes. The length and direction of the arrows show the impact that each of the predictor variables has in the model.
It appears that accessibility to radial highways (rad) is associated with a high crime rate (in blue) and this relationship is more prominent than those between other predictors and lower crime rates. Although the medium high crime rate (in green) is almost completely separated from the high crime rate, it is somewhat a more uniform class than the lower crime rates (in black and red) and has an association with environmental factors (nox).
Next, we can use the test set for predicting the classes in this new dataset and thus assess the model’s perfomance. Here is a cross tabulation of the actual vs predicted values.
## predicted
## correct low med_low med_high high Sum
## low 14 8 1 0 23
## med_low 4 15 9 0 28
## med_high 0 3 23 0 26
## high 0 0 0 25 25
## Sum 18 26 33 25 102
Let us begin with the high crime rate: all predictions for this class are correct ones. Medium high crime rate is also most often predicted correctly. On the other hand, the predictions for low and medium low crime rates are not as accurate. These results confirm what we saw above in connection with the biplot.
The aim is to create clusters (i.e. groups) of the observations. This can be done by adopting the k-means method which assigns the observations into k clusters based on the similarity of the objects. Similarity can in turn be measured in terms of distances. The following analysis is based on the Euclidean measure. Below is a summary of those distances.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.13 3.46 4.82 4.91 6.19 14.40
Now we can run the k-means algorithm. Because I do not yet know how many cluster I should use, I will begin by setting them at 10 which should be a relatively safe bet.
Fortunately, we do not need to rely on guessing as there are more sophisticated ways of finding the optimal number of clusters. Here, I will use total of within cluster sum of squares (WCSS) to do so. WCSS calculates the number of clusters for which the observations are closest to the center of the cluster. The optimal number is the one where to value drops significantly. In the line graph below we can see that the drop occurs where the number of clusters is two.
I will now run the k-means algorithm again with the clusters set at two. The figure below is a similar to the representation of pairs we saw earlier, but here it isthe scaled variables that are plotted against each other. Also, instead of the original values, the patterns seen here represent the similarities (or differences), i.e. distances, between the variables.
Disregarding the colours for a brief moment, the overall distribution patters appear nearly identical to those we saw earlier. However, now that clusters are included and shown in different colours, we can see how the patterns relate to the clusters. The differences between the clusters is not always clear, but for very main pairs they reveal certain uniqua, or even binary patterns. The differences are most clear for all instances of the target variable crim (crime rate) and chas (dummy variable), but also for rad (accessibility to radial highways) and tax (full-value property-tax rate).
The icing on the cake is the following 3D plot. Not only does it look really good, but you can move it around as well. It shows the same four classes as in the LDA plot seen earlier, and here too the high crime rate (in yellow) stands apart from the other crime levels.
Go to: top of this section or top of diary.
This week’s so called ‘human’ dataset originates from the United Nations Developed Programme. The current dataset consists of Human Development Index (HDI) data and the Gender Inequality Index (GII) data for most countries around the world. HDI and GII were joined together by using country as the key. Two new variables were added and selected along a handful of other variables for the final dataset which consists of 8 variables and 155 observations. Here is an example of the first observations in the dataset.
## GNI life_exp edu_exp mat_mort adol_birth parl_rep edu2_FM
## Norway 64992 81.6 17.5 4 7.8 39.6 1.007
## Australia 42261 82.4 20.2 6 12.1 30.5 0.997
## Switzerland 56431 83.0 15.8 6 1.9 28.5 0.983
## Denmark 44025 80.2 18.7 5 5.1 38.0 0.989
## Netherlands 45435 81.6 17.9 6 6.2 36.9 0.969
## Germany 43919 80.9 16.5 7 3.8 36.9 0.993
## lab_FM
## Norway 0.891
## Australia 0.819
## Switzerland 0.825
## Denmark 0.884
## Netherlands 0.829
## Germany 0.807
The variables for each country are:
| Variable | Description |
|---|---|
| GNI | Gross National Income per capita |
| life_exp | life expectancy at birth |
| edu_exp | expected years of schooling |
| mat_mort | maternal mortality ratio |
| adol_birth | adolescent birth rate |
| parl_rep | % of female representatives in parliament |
| edu2_FM | ratio of female and male populations with secondary education |
| lab_FM | ratio of labour force participation of females and males |
Below are summaries of these variables.
## GNI life_exp edu_exp mat_mort
## Min. : 581 Min. :49.0 Min. : 5.4 Min. : 1
## 1st Qu.: 4198 1st Qu.:66.3 1st Qu.:11.2 1st Qu.: 12
## Median : 12040 Median :74.2 Median :13.5 Median : 49
## Mean : 17628 Mean :71.7 Mean :13.2 Mean : 149
## 3rd Qu.: 24512 3rd Qu.:77.2 3rd Qu.:15.2 3rd Qu.: 190
## Max. :123124 Max. :83.5 Max. :20.2 Max. :1100
## adol_birth parl_rep edu2_FM lab_FM
## Min. : 0.6 Min. : 0.0 Min. :0.172 Min. :0.186
## 1st Qu.: 12.7 1st Qu.:12.4 1st Qu.:0.726 1st Qu.:0.598
## Median : 33.6 Median :19.3 Median :0.938 Median :0.753
## Mean : 47.2 Mean :20.9 Mean :0.853 Mean :0.707
## 3rd Qu.: 72.0 3rd Qu.:27.9 3rd Qu.:0.997 3rd Qu.:0.854
## Max. :204.8 Max. :57.5 Max. :1.497 Max. :1.038
The following graph provides a visual overview of the data. It shows the distributions of each variable, as well as the relationships between different variables, including their correlation coefficients.
Gross National Income GNI has a fairly strong positive correlation with life_exp (0.627) and edu_exp (0.624) which seem logical as the wealthier the country is the better standard of living people tend to have, i.e. they are likelier to live healthier and thus longer, and they have better education possibilities. Interestingly, life_exp and edu_exp have the strongest positive correlation (0.789). Close behind are the adolescent birth rate adol_birth and the maternal mortality ratio mat_mort (0.759).
Let us take a closer look at these pairs.
There is a lot of variation in the lower end of GNI in both cases (top row). In the lower row pairs, the overall trend resembles a linear correlation, but there is a lot of variation in both.
As for negative correlations, maternal mortality ratio mat_mort is a factor in the top-two pairs: with life expectancy at birth life_exp (-0.857) and expexted years of schooling edu_exp (-0.726).
Both life expectancy and expected years of schooling are fairly high when the maternal mortality rate is very low, and their values drop fairly quickly as the maternal mortality ratio increases.
It is time to perform principal component analysis (PCA) for which I will use the Singular Value Decomposition (SVD). The idea behind PCA is to decompose, i.e. reduce, dimensions in order to find the underlying principal components.
I will begin with the original dataset, and use the first two principal components to draw a biplot of the PCA. The amount of warnings visible below are the first clue that this is perhaps not the best approach possible.
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Next, I will standardize the dataset and repeat the previous steps drawing a new biplot.
The results are very different dependin on whether a non-standardized or a standardized dataset was used. In the former case, all of the explanatory weight is on the first principal component (PC1 = 100%) and thus it highlights just one variable, GNI, as differing from the others. The GNI arrow is completely flat which indicates a very high positive correlation. As can be seen from the warnings above, the other variables have so little significanse that their arrows cannot even be drawn. This results in the pink mess in the biplot where the other variables are stacked on top of each other. So, this analysis is trying to indicate that GNI is the only feature which can be used to explain differences between countries. The GNI arrow is also very long meaning that the variable has a high standard deviation, which is true, but the problem is that the dataset has not been standardized. This original dataset contains variables operating on different scales (years, monetary, ratios). GNI is a monetary unit and it dominates because its variation is so much higher than that of the other variables.
In the second biplot, we see the results of the same analysis on the standardized dataset. The results are very different due to the standardization. The first principal component is still the most significant level (53.6%), but now the second component also has some weight to it (16.2%). Furthermore, three different patterns can be seen, and GNI is part of only one of them. It is very interesting that the variables grouped together are those that were identified earlier when correlations were discussed.
We have two groups on the first principal component dimension which could be summarized as wealth (both as income and as capital) and health, and mortality. In the first case Gross National Income per capita (GNI), life expectancy at birth (life_exp), expected years of education (edu_exp) and the ratio of female and male populations with secondary education (edu2_FM) are identified as having a strong correlation together. Opposed them is health in the point of view of mortality - adolescent birth rate (adol_birth) correlating with the maternal mortality ratio (mat_mort).
On the second dimension and independent of the previous issues, we have measures for gender equality (or inequality) represented by two fairly correlated features: proportion of female representatives in parliament (parl_rep) and the ratio of labour force participation of females and males (lab_FM). So, if a country has a high GNI it does not, however, say anything about how women are represented in its labour force. For example, countries on the Arabian peninsula (Qatar, Kuwait, Saudi Arabia etc.) are among the wealthiest in the world (per capita) but the labour force, not to mention governmental positions, is (nearly) completely dominated by men.
In this section I will use the tea data included in the FactoMineR package.
There are 36 variables and 300 observations. Except for age, all the variables are categorical.
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
The data has to do with answers to a questionnaire concerned with tea consumption charting how and when people drink tea,
what their perceptions of tea are,
and it also contains a few personal details (e.g. age, sex) of the respondents.
As we can see, visualizing this amount of variables is possible, but is perhaps not overtly informative. Next, I will proceed with the Multiple Correspondence Analysis (MCA) which is similar to the PCA done above in that it reduces dimensions and is thus able to identify underlying pattersn and structures in the data. Here, I am using the original dataset, but excluding age as it is not a categorical variable. Below is a summary of this analysis.
##
## Call:
## MCA(X = tea_notage, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.090 0.082 0.070 0.063 0.056 0.053
## % of var. 5.838 5.292 4.551 4.057 3.616 3.465
## Cumulative % of var. 5.838 11.130 15.681 19.738 23.354 26.819
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.050 0.048 0.047 0.044 0.041 0.040
## % of var. 3.272 3.090 3.053 2.834 2.643 2.623
## Cumulative % of var. 30.091 33.181 36.234 39.068 41.711 44.334
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.039 0.037 0.036 0.035 0.034 0.032
## % of var. 2.531 2.388 2.302 2.275 2.172 2.085
## Cumulative % of var. 46.865 49.252 51.554 53.829 56.000 58.086
## Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
## Variance 0.031 0.031 0.030 0.028 0.027 0.026
## % of var. 2.013 2.011 1.915 1.847 1.740 1.686
## Cumulative % of var. 60.099 62.110 64.025 65.872 67.611 69.297
## Dim.25 Dim.26 Dim.27 Dim.28 Dim.29 Dim.30
## Variance 0.025 0.025 0.024 0.024 0.023 0.022
## % of var. 1.638 1.609 1.571 1.524 1.459 1.425
## Cumulative % of var. 70.935 72.544 74.115 75.639 77.099 78.523
## Dim.31 Dim.32 Dim.33 Dim.34 Dim.35 Dim.36
## Variance 0.021 0.020 0.020 0.019 0.019 0.018
## % of var. 1.378 1.322 1.281 1.241 1.222 1.152
## Cumulative % of var. 79.901 81.223 82.504 83.745 84.967 86.119
## Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.017 0.017 0.016 0.015 0.015 0.014
## % of var. 1.092 1.072 1.019 0.993 0.950 0.924
## Cumulative % of var. 87.211 88.283 89.301 90.294 91.244 92.169
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48
## Variance 0.014 0.013 0.012 0.011 0.011 0.010
## % of var. 0.891 0.833 0.792 0.729 0.716 0.666
## Cumulative % of var. 93.060 93.893 94.684 95.414 96.130 96.796
## Dim.49 Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.010 0.009 0.009 0.008 0.007 0.006
## % of var. 0.660 0.605 0.584 0.519 0.447 0.390
## Cumulative % of var. 97.456 98.060 98.644 99.163 99.610 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.580 1.246 0.174 | 0.155 0.098 0.012 | 0.052
## 2 | -0.376 0.522 0.108 | 0.293 0.350 0.066 | -0.164
## 3 | 0.083 0.026 0.004 | -0.155 0.099 0.015 | 0.122
## 4 | -0.569 1.196 0.236 | -0.273 0.304 0.054 | -0.019
## 5 | -0.145 0.078 0.020 | -0.142 0.083 0.019 | 0.002
## 6 | -0.676 1.693 0.272 | -0.284 0.330 0.048 | -0.021
## 7 | -0.191 0.135 0.027 | 0.020 0.002 0.000 | 0.141
## 8 | -0.043 0.007 0.001 | 0.108 0.047 0.009 | -0.089
## 9 | -0.027 0.003 0.000 | 0.267 0.291 0.049 | 0.341
## 10 | 0.205 0.155 0.028 | 0.366 0.546 0.089 | 0.281
## ctr cos2
## 1 0.013 0.001 |
## 2 0.127 0.021 |
## 3 0.071 0.009 |
## 4 0.002 0.000 |
## 5 0.000 0.000 |
## 6 0.002 0.000 |
## 7 0.095 0.015 |
## 8 0.038 0.006 |
## 9 0.553 0.080 |
## 10 0.374 0.052 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.182 0.504 0.031 3.022 | 0.020 0.007 0.000 0.330
## Not.breakfast | -0.168 0.465 0.031 -3.022 | -0.018 0.006 0.000 -0.330
## Not.tea time | -0.556 4.286 0.240 -8.468 | 0.004 0.000 0.000 0.065
## tea time | 0.431 3.322 0.240 8.468 | -0.003 0.000 0.000 -0.065
## evening | 0.276 0.830 0.040 3.452 | -0.409 2.006 0.087 -5.109
## Not.evening | -0.144 0.434 0.040 -3.452 | 0.214 1.049 0.087 5.109
## lunch | 0.601 1.678 0.062 4.306 | -0.408 0.854 0.029 -2.924
## Not.lunch | -0.103 0.288 0.062 -4.306 | 0.070 0.147 0.029 2.924
## dinner | -1.105 2.709 0.092 -5.240 | -0.081 0.016 0.000 -0.386
## Not.dinner | 0.083 0.204 0.092 5.240 | 0.006 0.001 0.000 0.386
## Dim.3 ctr cos2 v.test
## breakfast | -0.107 0.225 0.011 -1.784 |
## Not.breakfast | 0.099 0.208 0.011 1.784 |
## Not.tea time | 0.062 0.069 0.003 0.950 |
## tea time | -0.048 0.054 0.003 -0.950 |
## evening | 0.344 1.653 0.062 4.301 |
## Not.evening | -0.180 0.864 0.062 -4.301 |
## lunch | 0.240 0.343 0.010 1.719 |
## Not.lunch | -0.041 0.059 0.010 -1.719 |
## dinner | 0.796 1.805 0.048 3.777 |
## Not.dinner | -0.060 0.136 0.048 -3.777 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.031 0.000 0.011 |
## tea.time | 0.240 0.000 0.003 |
## evening | 0.040 0.087 0.062 |
## lunch | 0.062 0.029 0.010 |
## dinner | 0.092 0.000 0.048 |
## always | 0.056 0.035 0.007 |
## home | 0.016 0.002 0.030 |
## work | 0.075 0.020 0.022 |
## tearoom | 0.321 0.019 0.031 |
## friends | 0.186 0.061 0.030 |
The eigenvalue measures the variance retained in the dimension. There is a total of 54 dimensions and for each dimension its variance is smaller than in the previous dimension. As for the categorical variables list, all values in all three dimensions are generally low, i.e. no one variable has any significant connection with the dimension.
Here is a biplot of the analysis. Since so many variables were included, the resulting plot is frankly a great big mess, and not at all helpful.
I will now reduce the amount of variables and select those which I think could be the most important ones, according to my own experiences. So, the following analysis contains these variables: Tea, How, how, where, sugar, price, relaxing and home. Here is a summary of their MCA.
##
## Call:
## MCA(X = tea_habit, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.267 0.214 0.169 0.160 0.145 0.135
## % of var. 12.557 10.082 7.970 7.520 6.829 6.365
## Cumulative % of var. 12.557 22.640 30.610 38.131 44.959 51.324
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.128 0.126 0.121 0.112 0.107 0.100
## % of var. 6.026 5.910 5.706 5.266 5.042 4.705
## Cumulative % of var. 57.350 63.259 68.965 74.232 79.274 83.978
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17
## Variance 0.086 0.082 0.069 0.063 0.041
## % of var. 4.024 3.851 3.238 2.987 1.922
## Cumulative % of var. 88.003 91.853 95.091 98.078 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.398 0.198 0.040 | 0.326 0.165 0.027 | 0.398
## 2 | -0.134 0.022 0.012 | -0.117 0.021 0.009 | 0.590
## 3 | -0.288 0.104 0.117 | -0.067 0.007 0.006 | 0.027
## 4 | -0.404 0.204 0.225 | -0.009 0.000 0.000 | -0.249
## 5 | -0.288 0.104 0.117 | -0.067 0.007 0.006 | 0.027
## 6 | -0.477 0.284 0.105 | 0.292 0.132 0.039 | 0.148
## 7 | -0.288 0.104 0.117 | -0.067 0.007 0.006 | 0.027
## 8 | -0.189 0.045 0.025 | -0.118 0.022 0.010 | 0.715
## 9 | 0.519 0.337 0.138 | -0.575 0.514 0.170 | 0.158
## 10 | 0.741 0.686 0.295 | -0.479 0.357 0.124 | 0.473
## ctr cos2
## 1 0.312 0.040 |
## 2 0.685 0.224 |
## 3 0.001 0.001 |
## 4 0.122 0.085 |
## 5 0.001 0.001 |
## 6 0.043 0.010 |
## 7 0.001 0.001 |
## 8 1.006 0.359 |
## 9 0.049 0.013 |
## 10 0.440 0.120 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.437 2.207 0.063 4.325 | -0.046 0.030
## Earl Grey | -0.225 1.527 0.091 -5.227 | -0.128 0.611
## green | 0.336 0.583 0.014 2.044 | 0.849 4.626
## alone | -0.020 0.012 0.001 -0.471 | 0.155 0.913
## lemon | 0.525 1.422 0.034 3.193 | -0.185 0.221
## milk | -0.274 0.737 0.020 -2.440 | -0.117 0.169
## other | 0.423 0.251 0.006 1.286 | -1.861 6.064
## tea bag | -0.585 9.087 0.448 -11.569 | 0.287 2.731
## tea bag+unpackaged | 0.382 2.144 0.067 4.465 | -0.958 16.767
## unpackaged | 1.765 17.507 0.425 11.269 | 1.143 9.154
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.001 -0.454 | 1.172 25.002 0.450 11.596 |
## Earl Grey 0.029 -2.963 | -0.478 10.848 0.412 -11.101 |
## green 0.089 5.161 | 0.168 0.228 0.003 1.019 |
## alone 0.045 3.656 | -0.090 0.384 0.015 -2.110 |
## lemon 0.004 -1.127 | -0.955 7.403 0.113 -5.805 |
## milk 0.004 -1.046 | 0.524 4.259 0.073 4.674 |
## other 0.107 -5.660 | 1.771 6.948 0.097 5.387 |
## tea bag 0.108 5.683 | 0.128 0.688 0.022 2.537 |
## tea bag+unpackaged 0.419 -11.186 | -0.176 0.716 0.014 -2.055 |
## unpackaged 0.178 7.301 | -0.146 0.190 0.003 -0.935 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.092 0.090 0.489 |
## How | 0.052 0.126 0.257 |
## how | 0.613 0.491 0.022 |
## where | 0.704 0.652 0.027 |
## sugar | 0.057 0.011 0.206 |
## price | 0.602 0.311 0.066 |
## relaxing | 0.012 0.000 0.040 |
## home | 0.002 0.032 0.248 |
As we reduced the amount of variables, there are only 17 dimensions. The eigenvalues have not increased in absolute terms, but their proportion has. Also, the squared correlations show that the variable where has a rather strong correlation with dimensions 1 (0.704) and 2 (0.652).
Now let us draw the biplot again using only the selected variables along with their values. It looks a lot more intelligible than the previous one.
So what can we say based on this graph? Well, for one thing it looks like drinking tea at home goes together with Earl Grey (or black tea) with milk, most likely bought from a chain store, and the type of tea changes to green tea once outside the home. So, tea consumption at home could be described as a low-key, ordinary activity in which unpackaged tea varieties sold at tea shops play are likely not to play part.
Go to: top of this section or top of diary.